Project 2 - Page Rank

Assigned: Jan 31 2008. Due:Monday, Feb 11 2008, 11:59 PM
Partners: You may work alone or with a partner. No groups bigger than 2 please.
Turnin: Your code and writeup before the due data. Turnin here

Introduction:

Page Rank is an algorithm that operates on a link graph assigning to each node of the graph a numerical weight. It's purpose is to associate with every node in the graph a relative 'importance.' Page Rank is an iterative algorithm that runs on a graph with its nodes initialized to a certain set of values until the values converge (ie - stop changing).

The relevant citation is as follows:

For more information on the Page Rank algorithm you will need to read the paper: http://infolab.stanford.edu/~backrub/google.html (Section 2.1)

Your Goal:

Implement PageRank using map reduce, turn Wikipedia into a giant graph, run PageRank on said graph, run it several more times (ideally until the values converge), return (in a humanly parsable sort of way) the PageRank of all the articles. You will also need to implement at least one extension.

Data Set:

As your final data set, you will use a crawl of wikipedia articles. You should already be familiar with it from project 1, but here is a refresher. The data set is located on the DFS at /shared/wikipedia (roughly 12 GB of text in 193 files each sized roughly 64MB). Each wikipedia article is on its own line. A typical line of the data set (article) will start as follows:

Notice that the title of the article is delimited by an XML title tag, and the text of the article is delimited by an XML text tag. The links in the article to other wikipedia articles are delimited by "[[" and "]]". Also, certain links will look like this: "[[Longer Real Name|Short Name]]". These links appear in the article with "Short Name" but link to an article titled "Longer Real Name".

A Possible Outline:

  1. graphBuilder - Construct a link graph out of wikipedia. Make sure this is space efficient - what kind of graph representation can you use? You may also find SequenceFileOutputFormat useful for this.
  2. pageRankIter - One iteration updates the page rank values for your page rank graph.
  3. pageRankViewer - Returns the page ranks for wikipedia articles in a human readable way. (For example converting from SequenceFileInputFormat to TextOutpuFormat and sorting on page rank.)

Extensions:

You must implement at least one extension, either from the list below or your own, and you should feel free to implement more!

Hints:

Write-Up:

  1. How long do you think this project will take you to finish?
  2. How much time did you actually spend on the project?
  3. Acknowledge any assistance you received from anyone except assigned course readings and the course staff.
  4. Explain why it's ok to use a combiner class with the PageRank algorithm.
  5. What are the ten pages with highest rank in the provided Wikipedia corpus?
  6. Describe any extensions you've implemented.